58 research outputs found
Dynamic Control Flow in Large-Scale Machine Learning
Many recent machine learning models rely on fine-grained dynamic control flow
for training and inference. In particular, models based on recurrent neural
networks and on reinforcement learning depend on recurrence relations,
data-dependent conditional execution, and other features that call for dynamic
control flow. These applications benefit from the ability to make rapid
control-flow decisions across a set of computing devices in a distributed
system. For performance, scalability, and expressiveness, a machine learning
system must support dynamic control flow in distributed and heterogeneous
environments.
This paper presents a programming model for distributed machine learning that
supports dynamic control flow. We describe the design of the programming model,
and its implementation in TensorFlow, a distributed machine learning system.
Our approach extends the use of dataflow graphs to represent machine learning
models, offering several distinctive features. First, the branches of
conditionals and bodies of loops can be partitioned across many machines to run
on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs.
Second, programs written in our model support automatic differentiation and
distributed gradient computations, which are necessary for training machine
learning models that use control flow. Third, our choice of non-strict
semantics enables multiple loop iterations to execute in parallel across
machines, and to overlap compute and I/O operations.
We have done our work in the context of TensorFlow, and it has been used
extensively in research and production. We evaluate it using several real-world
applications, and demonstrate its performance and scalability.Comment: Appeared in EuroSys 2018. 14 pages, 16 figure
Bigtable: A Distributed Storage System for Structured Data
BigTable是一个分布式存储系统,它可以支持扩展到很大尺寸的数据:PB级别的数据,包含几千个商业服务器。Google的许多项目都存储在BigTable中,包括WEB索引、Google Earth 和Google Finance。这些应用对BigTable提出了截然不同的需求,无论是从数据量(从URL到网页到卫星图像)而言,还是从延迟需求(从后端批量处理到实时数据服务)而言。尽管这些不同的需求,BigTable已经为所有的Google产品提供了一个灵活的、高性能的解决方案。本文中,我们描述了BigTable提供的简单数据模型,它允许客户端对数据部署和格式进行动态控制,我们描述了BigTable的设计和实施
BigDL: A Distributed Deep Learning Framework for Big Data
This paper presents BigDL (a distributed deep learning framework for Apache
Spark), which has been used by a variety of users in the industry for building
deep learning applications on production big data platforms. It allows deep
learning applications to run on the Apache Hadoop/Spark cluster so as to
directly process the production data, and as a part of the end-to-end data
analysis pipeline for deployment and management. Unlike existing deep learning
frameworks, BigDL implements distributed, data parallel training directly on
top of the functional compute model (with copy-on-write and coarse-grained
operations) of Spark. We also share real-world experience and "war stories" of
users that have adopted BigDL to address their challenges(i.e., how to easily
build end-to-end data analysis and deep learning pipelines for their production
data).Comment: In ACM Symposium of Cloud Computing conference (SoCC) 201
Spanner: Google’s Globally-Distributed Database
Spanner是谷歌公司研发的、可扩展的、多版本、全球分布式、同步复制数据库。它是第一个把数据分布在全球范围内的系统,并且支持外部一致性的分布式事务。本文描述了Spanner的架构、特性、不同设计决策的背后机理和一个新的时间API,这个API可以暴露时钟的不确定性。这个API及其实现,对于支持外部一致性和许多强大特性而言,是非常重要的,这些强大特性包括:非阻塞的读、不采用锁机制的只读事务、原子模式变更
PaLM: Scaling Language Modeling with Pathways
Large language models have been shown to achieve remarkable performance
across a variety of natural language tasks using few-shot learning, which
drastically reduces the number of task-specific training examples needed to
adapt the model to a particular application. To further our understanding of
the impact of scale on few-shot learning, we trained a 540-billion parameter,
densely activated, Transformer language model, which we call Pathways Language
Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML
system which enables highly efficient training across multiple TPU Pods. We
demonstrate continued benefits of scaling by achieving state-of-the-art
few-shot learning results on hundreds of language understanding and generation
benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough
performance, outperforming the finetuned state-of-the-art on a suite of
multi-step reasoning tasks, and outperforming average human performance on the
recently released BIG-bench benchmark. A significant number of BIG-bench tasks
showed discontinuous improvements from model scale, meaning that performance
steeply increased as we scaled to our largest model. PaLM also has strong
capabilities in multilingual tasks and source code generation, which we
demonstrate on a wide array of benchmarks. We additionally provide a
comprehensive analysis on bias and toxicity, and study the extent of training
data memorization with respect to model scale. Finally, we discuss the ethical
considerations related to large language models and discuss potential
mitigation strategies
Main memory logs.
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (p. 113-117).by Sanjay Ghemawat.Ph.D
Disk Management for Object-Oriented Databases
This paper proposes efficient disk management techniques for such databases
- …